On the Comparison of Semi-Supervised Hierarchical Clustering Algorithms in Text Mining Tasks
نویسندگان
چکیده
Semi-supervised clustering approaches have emerged as an option for enhancing clustering results. These algorithms use external information to guide the clustering process. In particular, semi-supervised hierarchical clustering approaches have been explored in many fields in the last years. These algorithms provide efficient and personalized hierarchical overviews of datasets. To the best of the authors’ knowledge, however, semi-supervised hierarchical clustering algorithms have not been extensively tested in document clustering. In this paper, we compare the performance in clustering document collections of three state of the art semi-supervised hierarchical clustering algorithms. These algorithms employ different ways of incorporating the external information in the clustering process: the CCL, which uses clusterlevel constraints based in Complete-Link approach; the HCAC, which uses cluster-level constraints based in the AverageLink approach; and the Pairwise-Constrained, which uses instance-level constraints based in the Average-Link approach. Experimental results indicate that using cluster-level or instance-level constraints achieve significant improvement in clustering quality when compared to unsupervised approaches. In particular, HCAC outperforms other methods with statistical significance.
منابع مشابه
Semi-supervised Hierarchical Clustering Analysis for High Dimensional Data
In many data mining tasks, there is a large supply of unlabeled data but limited labeled data since it is expensive generated. Therefore, a number of semi-supervised clustering algorithms have been proposed, but few of them are specially designed for high dimensional data. High dimensionality is a difficult challenge for clustering analysis due to the inherent sparse distribution, and most of p...
متن کاملخوشهبندی اسناد مبتنی بر آنتولوژی و رویکرد فازی
Data mining, also known as knowledge discovery in database, is the process to discover unknown knowledge from a large amount of data. Text mining is to apply data mining techniques to extract knowledge from unstructured text. Text clustering is one of important techniques of text mining, which is the unsupervised classification of similar documents into different groups. The most important step...
متن کاملConcurrent Semi-supervised Learning of Data Streams
Conventional stream mining algorithms focus on single and stand-alone mining tasks. Given the single-pass nature of data streams, it makes sense to maximize throughput by performing multiple complementary mining tasks concurrently. We investigate the potential of concurrent semi-supervised learning on data streams and propose an incremental algorithm called CSL-Stream (Concurrent Semi–supervise...
متن کاملUsing Supervised Clustering Technique to Classify Received Messages in 137 Call Center of Tehran City Council
Supervised clustering is a data mining technique that assigns a set of data to predefined classes by analyzing dataset attributes. It is considered as an important technique for information retrieval, management, and mining in information systems. Since customer satisfaction is the main goal of organizations in modern society, to meet the requirements, 137 call center of Tehran city council is ...
متن کاملUsing Supervised Clustering Technique to Classify Received Messages in 137 Call Center of Tehran City Council
Supervised clustering is a data mining technique that assigns a set of data to predefined classes by analyzing dataset attributes. It is considered as an important technique for information retrieval, management, and mining in information systems. Since customer satisfaction is the main goal of organizations in modern society, to meet the requirements, 137 call center of Tehran city council is ...
متن کامل